Business Context. You are a research analyst a large investment bank. Your team specializes in analyzing the airline industry and predicting future revenues and costs so that they may recommend investment strategies to their clients. Having just completed data science training, you are eager to apply your newly acquired skills to the problem. Your team already has a lot of data in this domain, but you think that there is additional data than can be used to develop better predictions. Your available data is current up to January 2020 and your goal is to predict the next quarterly earnings coming out in February 2020.
Business Problem. Your task is to build a model to predict the future revenues of United Airlines
Analytical Context. As part of a large financial services company, the following data has already been collected and is in use for your team:
You have also been looking for additional data to enhance your model. After considering many different sources, you are primarily interested in:
The case will proceed as follows: you will (1) Look at your team's current data and assess its deficiencies; (2) investigate alternative data sources; (3) scrape the data from these sources and perform data cleaning, EDA, and feature engineering; and finally (4) create a predictive model.
## Load relevant packages
import time
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import seaborn as sns
from datetime import datetime
from geopy.geocoders import Nominatim
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Before diving into any data science project, ourr first step should always be to assess the current data to understand which pieces of information we could be missing. In some cases, you will have no data and will have to start from scratch. Here, we already have 3 different data sources and so we should look at each one individually and all of them as a whole to figure out how exactly we should supplement them. At every stage, we should keep in mind our objective: predicting future revenues. That means we should be thinking about the following question: "What pieces of information would be useful for predicting airline revenue?".
Let's first look at the airline revenue data. We'll import the file and visualize the data:
airline_revenues = pd.read_csv('./airline_revenues.csv')
airline_revenues
Here are the features of this data:
We are only interested in United because we want to show that this predictive model works for one airline as a proof of concept before building a more general model. As you are new to the team, you are not sure what the code for United is, so you look up the relevant documentation and see that it is UA. Since we are only interested in revenues, let's simplify this DataFrame:
# only get United revenues
united_revenues = airline_revenues[airline_revenues['UNIQUE_CARRIER'] == 'UA']
# only keep relevant columns
columns_to_keep = ['YEAR', 'MONTH', 'DAY', 'REVENUE']
united_revenues = united_revenues[columns_to_keep]
united_revenues
Now that we have the relevant information, let's plot the data to see what it looks like. Your team uses the following nifty function to plot time series data:
def plot_time_series(dates, values, title, x_label, y_label):
"""
dates: must be a datetime series for the x axis
values: the y axis values to plot
title: string that goes above the plot
x_label: string that goes on the x-axis
y_label: string that goes on the y-axis
"""
years_locator = mdates.YearLocator()
months_locator = mdates.MonthLocator()
years_format = mdates.DateFormatter('%Y')
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(8, 6)
ax.ticklabel_format(axis='y', style='plain')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_major_locator(years_locator)
ax.xaxis.set_major_formatter(years_format)
ax.xaxis.set_minor_locator(months_locator)
sns.lineplot(x=dates, y=values, ci=None)
rotation = 45
plt.setp(ax.get_xticklabels(), rotation=rotation)
plt.xlabel(x_label, fontsize='16')
plt.ylabel(y_label, fontsize='16')
plt.title(title, fontsize='18')
plt.show()
united_revenues['DATE'] = united_revenues.apply(lambda r: datetime(int(r['YEAR']), int(r['MONTH']), int(r['DAY'])), axis=1)
plot_time_series(united_revenues['DATE'], united_revenues['REVENUE'], 'Revenue Over Time for United', 'Date', 'Revenue ($ Millions)')
Are there any interesting patterns you notice in this data?
Answer. Some thing we notice are seasonal cyclical behaviour, diminished demand in January, and a spike in 2011 caused by the merger of two airlines.
Repeat the above analysis for airline_fuel_cost.csv. You will see many variables; for the sake of the exercise, we are only interested in QUARTER and TOTAL_COST.
Answer. One possible solution is given below:
airline_costs = pd.read_csv('./airline_fuel_costs.csv')
airline_costs
# only get United revenues
united_costs = airline_costs[airline_costs['UNIQUE_CARRIER'] == 'UA']
# only keep relevant columns
columns_to_keep = ['YEAR', 'QUARTER', 'MONTH', 'TOTAL_COST']
united_costs = united_costs[columns_to_keep]
united_costs
united_costs['DATE'] = united_costs.apply(lambda r: datetime(int(r['YEAR']), int(r['MONTH']), 1), axis=1)
plot_time_series(united_costs['DATE'], united_costs['TOTAL_COST'], 'Costs Over Time for United', 'Date', 'Costs ($)')
To complete the analysis, we will now look at the price of oil products via the oil.csv file:
oil_prices = pd.read_csv('./oil.csv')
oil_prices['DATE'] = pd.to_datetime(oil_prices['DATE'])
oil_prices
From this, we can see that we have the prices for 3 different oil products at the beginning of every month. Let's plot them to see the difference.
plot_time_series(oil_prices['DATE'], oil_prices['NY_GASOLINE_PRICE'], 'NY Gasoline Price Over Time', 'Date', 'Price ($)')
plot_time_series(oil_prices['DATE'], oil_prices['US_GASOLINE_PRICE'], 'US Gasoline Price Over Time', 'Date', 'Price ($)')
plot_time_series(oil_prices['DATE'], oil_prices['KEROSENE_PRICE'], 'Kerosene Price Over Time', 'Date', 'Price ($)')
From this, we can see that all of them are very highly correlated with minor variations. To our delight, it is also very highly correlated with the airline costs generated earlier. For the purposes of this study, it should be sufficient to choose one of them.
Which one should you pick?
Answer. Kerosene is what airplanes use to fly and so should be the focus. It is always important to choose variables that have as clear causal relations to what we are predicting as possible. This sort of usage of domain knowledge is crucial to avoid data mining for correlations rather than causations.
columns_to_keep = ['YEAR', 'MONTH', 'DATE', 'KEROSENE_PRICE']
oil_prices = oil_prices[columns_to_keep]
oil_prices
In general, but especially in the case of time series, it's important to remember these data collection procedures. In this case, we have 3 time series: revenues, costs, and oil prices. As we know that profit is merely a function of costs, it would be tempting to throw away oil prices. Why do we care about the price of an input into cost when we have the actual cost? The answer lies in the delay between when the data is collected and what is actually happening at any given time. For this case:
How do these pieces fit together with respect to our goal, and what information are we missing?
Answer. Our goal is to predict revenue. However, all we have is information on the costs from the past. This might seem useless to us, but the costs are almost the same thing as revenues. If we make a big assumption (remember your asumptions) that the revenue/passenger and cost/passenger is roughly constant in the short term, then cost is a good predictor of revenue. So our initial model given our existing data looks something like:
$$\text{Revenue} = f( \text{Past Costs} )$$But this does not take into account anything in the present. Also, we know that revenue is directly tied to how many people fly on the airline. Given our assumption, if we could incorporate the number of passengers, then this would be a very good model:
$$\text{Revenue} = f( \text{Past Costs}, \text{Number of passenger} )$$This guides us to supplement our current data by trying to find a good measure of the current number of passengers on an airline.
Now that we know we want to get a prediction of the likely number of passengers on an airline, it would be good to find a dataset of the historical ridership of the airlines. After scouring the Internet, we find the perfect data source from the Bureau of Transportation Statistics. Namely, the T-100 Domestic Market (U.S. Carriers) dataset contains the following attributes:
The way that the data is organized is that it aggregates monthly ridership grouped by airline as well as origin and destination airports. That is, every row represents all the passengers that flew between two cities for a given month and airline. Looking at the data you already have, you decide that you will only use the information since 2012 as this is the first year of operation that United did in the form that we know today. The raw data is available in airline_passengers.csv. Let's investigate what this dataset looks like:
airline_passengers = pd.read_csv('./airline_passengers.csv')
airline_passengers
Let's now visulize this data. It would really be nice to visualize everything on a map to confirm that the data is accurate. The DataFrame shown above would be a nightmare to try and analyze without further transformation.
The first thing we notice is that there is a lot of data here and it would be nice to know which airports those IDs correspond to. As we are only interested in United, that should handle many of the rows. For the airports, you also found this on the BTS website documentation, which will help us figure out which airports are which. In addition, it also gives the LAT and LON (latitude and longitude) for each airport which will help us with map visualizations:
united_passengers = airline_passengers[airline_passengers['UNIQUE_CARRIER'] == 'UA']
united_passengers
airports = pd.read_csv('./airports.csv')
airports
As we can see, there are many different airports and many of them are quite small with unknown locations (e.g. "Unknown Point in Alaska"). We also see that some airports have NaN's listed for their location. Let's try filtering out the airports that United does not fly to and hope that the remaining rows all have their locations listed. First, we need the unique airports:
unique_airports = set(list(united_passengers['ORIGIN_AIRPORT_ID'].unique()))
unique_airports.update(list(united_passengers['DEST_AIRPORT_ID'].unique()))
len(unique_airports)
Now, let's restrict the airports DataFrame to those that United flies to and see if any of them are null:
united_airports = airports[airports['Code'].isin(list(unique_airports))]
united_airports[united_airports['LAT'].isnull()]
Unfortunately, many of them are null and some of them are large airports (George Bush Intercontinental in Houston is a big hub). What do we do?
It turns out that we can use the GeoPy package to grab locations given addresses. (In general, this is a good practice; when you have a problem with your data, be resourceful about going online and searching for something out there that can help solve your problem. Don't reinvent the wheel or conversely assume that there is nothing you can do.)
The bad part is that the addresses in the DataFrame are not really standardized. It would be really nice if we had the international 3-letter IATA code for each airport. Fortunately for us, the BTS provides it here. Let's turn this HTML table into a pandas DataFrame:
airport_codes = pd.read_html('https://www.bts.gov/topics/airlines-and-airports/world-airport-codes')[0]
airport_codes
Now we can easily merge this and find the locations using GeoPy as follows:
geolocator = Nominatim(user_agent="Airport Lat/Long Finder", timeout=100)
geolocator.geocode('IAH airport TX')
First, merge the united_airports DataFrame with the IATA codes DataFrame. Then using GeoPy, write a function to assign the missing latitudes and longitudes of the airports that United flies to.
Answer. One possible solution is shown below:
united_airports = united_airports.merge(airport_codes, left_on='Description', right_on='City: Airport', suffixes=('_BTS', '_IATA'))
united_airports
def get_airport_lat_long_by_iata_code(code, state):
geolocator = Nominatim(user_agent="Airport Lat/Long Finder", timeout=100)
code += ' airport '
code += state
try:
lat_long = geolocator.geocode(code)
if lat_long:
return str(lat_long.latitude), str(lat_long.longitude)
else:
return None, None
except:
print('Error reading ' + name)
return None, None
# needed to prevent crawling errors
time.sleep(1)
lat_lons = united_airports[united_airports['LAT'].isnull()].apply(lambda r: get_airport_lat_long_by_iata_code(r['Code_IATA'], r['Description'].split(':')[0].split(',')[-1]), axis=1)
for i in lat_lons.index:
united_airports['LAT'].loc[i] = lat_lons.loc[i][0]
united_airports['LON'].loc[i] = lat_lons.loc[i][1]
united_airports
united_paths = united_passengers.groupby(['ORIGIN_AIRPORT_ID', 'DEST_AIRPORT_ID', 'UNIQUE_CARRIER', 'YEAR', 'MONTH'])['PASSENGERS'].sum().reset_index()
united_paths = united_paths.merge(airports, left_on='ORIGIN_AIRPORT_ID', right_on='Code')
united_paths = united_paths.merge(airports, left_on='DEST_AIRPORT_ID', right_on='Code', suffixes=("_ORIGIN", "_DEST"))
united_paths
def plot_flight_paths(paths, airports):
fig = go.Figure()
tmp = paths.groupby(['LAT_ORIGIN', 'LON_ORIGIN', 'LAT_DEST', 'LON_DEST']).sum().reset_index()
max_passengers = float(tmp['PASSENGERS'].max())
for index, path in tmp.iterrows():
fig.add_trace(
go.Scattergeo(
locationmode = 'USA-states',
lon = [path['LON_ORIGIN'], path['LON_DEST']],
lat = [path['LAT_ORIGIN'], path['LAT_DEST']],
mode = 'lines',
hoverinfo = "none",
line = dict(width = 2, color = 'rgb(0, 93, 170)'),
opacity = path['PASSENGERS'] / max_passengers,
)
)
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = airports['LON'],
lat = airports['LAT'],
hoverinfo = 'text',
text = airports['Description'],
mode = 'markers',
marker = dict(
size = 2,
color = 'rgb(255, 0, 0)',
line = dict(
width = 3,
color = 'rgba(68, 68, 68, 0)'
)
)))
fig.update_layout(
title_text = 'Flights by United',
font = {'size':36},
showlegend = False,
geo = go.layout.Geo(
scope = 'north america',
projection_type = 'kavrayskiy7',
showland = True,
showlakes = True,
showcountries = True,
landcolor = 'rgb(243, 243, 243)',
countrycolor = 'rgb(204, 204, 204)',
),
)
fig.show()
plot_flight_paths(united_paths[united_paths['YEAR'] == 2012], united_airports)
What is Hawaii doing in Hong Kong's location? In this case, if you read the documentation on the BTS website, it would reaffirm that it is simply a wrong latitude and longitude. Let's change those for the LIH flights:
def replace_lat_lon(r, lat_or_lon, origin_or_dest):
"""
lat_or_lon: LAT or LON
origin_or_dest: ORIGIN or DEST
"""
LIH_code = 12982
LIH_lat, LIH_lon = 21.976111, -159.338889
copy_column = lat_or_lon + '_'
copy_column += origin_or_dest
code_column = 'Code_' + origin_or_dest
if r[code_column] == LIH_code:
return LIH_lat if lat_or_lon == 'LAT' else LIH_lon
else:
return r[copy_column]
united_paths['LAT_ORIGIN'] = united_paths.apply(lambda r: replace_lat_lon(r, 'LAT', "ORIGIN"), axis=1)
united_paths['LON_ORIGIN'] = united_paths.apply(lambda r: replace_lat_lon(r, 'LON', "ORIGIN"), axis=1)
united_paths['LAT_DEST'] = united_paths.apply(lambda r: replace_lat_lon(r, 'LAT', "DEST"), axis=1)
united_paths['LON_DEST'] = united_paths.apply(lambda r: replace_lat_lon(r, 'LON', "DEST"), axis=1)
Alter the plotting function above to take in the year and compare the United flights from 2012, 2015, and 2019. What are some trends that you notice?
Answer. As expected, we see a general increase in the number of passengers over time. One interesting thing we notice is the cluster of flights in the South Pacific. This is due to the monopoly that United has with flying to Guam. We can also see individual flight paths being removed and added (e.g. the one to Puerto Rico). However, the overall major trend is the increase in flights to the South, like in Texas and Florida. This makes sense because the NYC-LA corridor has been quite busy for a very long time with little room for growth:
def plot_flight_paths(paths, airports, year):
fig = go.Figure()
tmp = paths.groupby(['LAT_ORIGIN', 'LON_ORIGIN', 'LAT_DEST', 'LON_DEST', 'YEAR']).sum().reset_index()
max_passengers = float(tmp['PASSENGERS'].max())
tmp = tmp[tmp['YEAR'] == year]
for index, path in tmp.iterrows():
fig.add_trace(
go.Scattergeo(
locationmode = 'USA-states',
lon = [path['LON_ORIGIN'], path['LON_DEST']],
lat = [path['LAT_ORIGIN'], path['LAT_DEST']],
mode = 'lines',
hoverinfo = "none",
line = dict(width = 2, color = 'rgb(0, 93, 170)'),
opacity = path['PASSENGERS']**1.5 / max_passengers**1.5,
)
)
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = airports['LON'],
lat = airports['LAT'],
hoverinfo = 'text',
text = airports['Description'],
mode = 'markers',
marker = dict(
size = 2,
color = 'rgb(255, 0, 0)',
line = dict(
width = 3,
color = 'rgba(68, 68, 68, 0)'
)
)))
fig.update_layout(
title_text = 'Flights by United',
font = {'size':36},
showlegend = False,
geo = go.layout.Geo(
scope = 'north america',
projection_type = 'kavrayskiy7',
showland = True,
showlakes = True,
showcountries = True,
landcolor = 'rgb(243, 243, 243)',
countrycolor = 'rgb(204, 204, 204)',
),
)
fig.show()
plot_flight_paths(united_paths, united_airports, 2012)
plot_flight_paths(united_paths, united_airports, 2015)
plot_flight_paths(united_paths, united_airports, 2019)
This completes our basic EDA on the BTS data. The point that should be emphasized is that this EDA was used solely as a means to validate the data. We are not yet able to extract hypotheses about revenue from passenger data. Recall that we don't have this information until 3 months after the fact. However, if we did, we should make sure that this is useful information to have.
When dealing with time series, it's important to deal with trends. If two series are predominately dominated by a trend, then their individual fluctions are lost and this may lead to a spurious conclusion. Consider the following example of two random lines:
def plot_series_with_slope(m):
n = 100
t = np.linspace(0, n, n)
X_1 = m*t + np.random.normal(size=n)
X_2 = m*t + np.random.normal(size=n)
sns.lineplot(t, X_1)
sns.lineplot(t, X_2)
def get_corr_with_slope(m):
n = 100
t = np.linspace(0, 10*n, n)
X_1 = m*t + np.random.normal(size=n)
X_2 = m*t + np.random.normal(size=n)
return np.corrcoef(X_1, X_2)
plot_series_with_slope(0)
To the naked eye, these series look incredibly unrelated. Measuring their correlation confirms this:
get_corr_with_slope(0)
However, what happens if we increase the slope between them?:
plot_series_with_slope(0.1)
get_corr_with_slope(0.1)
As you can see, the correlation is almost perfect. To give you a sense of how fast the trend can overpower the relationship, here is the correlation as a function of the size of the slope:
slopes = np.logspace(-3,1,100)
corrs = []
for m in slopes:
corrs.append(get_corr_with_slope(m)[0,1])
f, ax = plt.subplots()
ax.set(xscale='log')
sns.lineplot(x=slopes, y=corrs)
In light of this fact, we need to remove the trend from this line. This is easily done by correlating the difference between points:
m = 1
n = 1000
t = np.linspace(0, 10*n, n)
X_1 = pd.Series(m*t + np.random.normal(size=n))
X_2 = pd.Series(m*t + np.random.normal(size=n))
def time_series_corr(X_1, X_2):
diff_1 = X_1[1:]-X_1.shift()[1:]
diff_2 = X_2[1:]-X_2.shift()[1:]
return np.corrcoef(diff_1, diff_2)
time_series_corr(X_1, X_2)
Now that we can correlate two time series, let's check out how passengers and revenues correlate. The first thing we need is to merge the revenues and the passengers DataFrames. As it stands, our passengers DataFrame has a row for each origin and destination airports, for every month. So, we need to aggregate these numbers for every month. However, revenues only come out every quarter, so we will also add the quarter to the passengers DataFrame:
united_passengers_by_month = united_passengers.groupby(['YEAR', 'MONTH']).sum()['PASSENGERS'].reset_index()
united_passengers_by_month['DATE'] = united_passengers_by_month.apply(lambda r: datetime(int(r['YEAR']), int(r['MONTH']), 1), axis=1)
united_passengers_by_month
plot_time_series(united_passengers_by_month['DATE'], united_passengers_by_month['PASSENGERS'], 'United Passengers Over Time', 'Date', 'Passengers')
plot_time_series(united_revenues['DATE'], united_revenues['REVENUE'], 'United Revenues Over Time', 'Date', 'Revenues ($)')
Before we can correlate these series, we need to join on the quarter so that the DataFrames match in granularity:
united_passengers_by_quarter = united_passengers.groupby(['YEAR', 'QUARTER']).sum()['PASSENGERS'].reset_index()
united_passengers_by_quarter
united_revenues['QUARTER'] = united_revenues.apply(lambda r: int(r['MONTH']//3), axis=1)
merged_df = pd.merge(united_revenues, united_passengers_by_quarter, on=['YEAR', 'QUARTER'])
time_series_corr(merged_df['REVENUE'], merged_df['PASSENGERS'])
From this, we can see that passengers are extremely correlated with revenues. Thus, knowing the number of passengers provides a high quality signal to predict revenues. However, we still have the issue of lagged information. Can you think of some data sources that might be able to give us information up to the present?
To deal with the lag issue, we will try to predict the current amount of passengers using Twitter. Specifically, we hypothesize that the number of tweets in a given month is correlated with the passengers of that month. United's Twitter can be found in the united_tweets.csv and has the following simple features:
united_tweets = pd.read_csv('./united_tweets.csv')
united_tweets
Now we check if these tweets are correlated with passengers by aggregating the tweets by month:
united_tweets['DATE'] = pd.to_datetime(united_tweets['DATE'])
united_tweets['YEAR'] = united_tweets['DATE'].dt.year
united_tweets['MONTH'] = united_tweets['DATE'].dt.month
united_tweets_by_month = united_tweets.groupby(['YEAR', 'MONTH']).sum().reset_index()
united_tweets_by_month
time_series_corr(united_tweets_by_month['COUNT'], united_passengers_by_month['PASSENGERS'])
Unfortunately, this is an incredibly disappointing result. Let's see if we can figure out what's going on:
plot_time_series(united_tweets['DATE'], united_tweets['COUNT'], 'United Tweets Over Time', 'Date', 'Number of Tweets')
That is an incredibly large anomaly. Let's check it out.
Answer. Shown below:
united_tweets[united_tweets['COUNT'] > 300000]
A quick Google check will show that United had a notorious incident on April 10, 2017.
How should we handle the anomaly?
Answer. We should perform some type of smoothing over the time series. There are many types of smoothing we can do:
These are all excellent options but we will need to objectively measure which one is the best.
Before we start feature engineering, we need a way to quantify the usefulness of what we are trying to accomplish. In this case, we are looking for a signal that is highly correlated with passenger data of United. Even though our goal is to predict revenues, the big picture idea is to use passenger data to predict revenues and then use Twitter data to predict the most recent passenger data. Thus, we will use our time_series_corr() function on our engineered feature and the monthly passenger data to quantify it's usefulness.
First, we'll start with a simple moving average for the last thirty days. We will use pandas rolling() function. The rolling() function is very similar to the groupby() function. It essentially "groups" the rows in a rolling fashion:
n_days = 30
united_tweets['SMA'] = united_tweets['COUNT'].rolling(n_days).mean()
united_tweets
Let's plot it and assess it's usefulness:
plot_time_series(united_tweets['DATE'].iloc[n_days:], united_tweets['SMA'].iloc[n_days:], 'SMA of United Tweets', 'Date', 'Number of Tweets')
# need to convert tweets to month
time_series_corr(united_tweets.iloc[n_days:].groupby(['YEAR', 'MONTH']).mean()['SMA'], united_passengers_by_month['PASSENGERS'])
The nice thing about the rolling() function is that it let's you apply different weights to the rolling window. For example, if you want to favor the most recent data points, all you have to do is supply the win_type argument. See the documentation link above for all of the different window types. Or, you can use the ewm() function if you want an exponential moving average:
alpha = 0.5
united_tweets['EMA'] = united_tweets['COUNT'].ewm(alpha=alpha).mean()
plot_time_series(united_tweets['DATE'], united_tweets['EMA'], 'EMA of United Tweets', 'Date', 'Number of Tweets')
# need to convert tweets to month
time_series_corr(united_tweets.groupby(['YEAR', 'MONTH']).mean()['EMA'], united_passengers_by_month['PASSENGERS'])
As you can see, this is still disappointing. The root cause is the harsh spike up in 2017. Is there another aggregation function that is harsher to big spikes? Yes! The median function:
united_tweets['MMA'] = united_tweets['COUNT'].rolling(n_days).median()
plot_time_series(united_tweets['DATE'].iloc[n_days:], united_tweets['MMA'].iloc[n_days:], 'MMA of United Tweets', 'Date', 'Number of Tweets')
# need to convert tweets to month
time_series_corr(united_tweets.iloc[n_days:].groupby(['YEAR', 'MONTH']).mean()['MMA'], united_passengers_by_month['PASSENGERS'])
Now we are making some progress! But the spikes are still problematic. If we applied the median over a larger range, the spikes disappear:
n_days = 60
united_tweets['MMA'] = united_tweets['COUNT'].rolling(n_days).median()
plot_time_series(united_tweets['DATE'].iloc[n_days:], united_tweets['MMA'].iloc[n_days:], 'MMA of United Tweets', 'Date', 'Number of Tweets')
However, the problem is that we are capped by the size of the window because then we start losing information of the seasons. One easy thing we can do is take the median of the rolling median:
n_days = 30
united_tweets['MMA'] = united_tweets['COUNT'].rolling(n_days).median()
# need to convert tweets to month
time_series_corr(united_tweets.iloc[n_days:].groupby(['YEAR', 'MONTH']).median()['MMA'], united_passengers_by_month['PASSENGERS'])
Again, some progress, but the spikes still need to be dealt with. What if we group by month first instead of doing a rolling window?:
united_tweets_by_month = united_tweets.groupby(['YEAR', 'MONTH'])['COUNT'].median().reset_index()
plot_time_series(united_passengers_by_month['DATE'], united_tweets_by_month['COUNT'], 'Monthly Median of United Tweets', 'Date', 'Number of Tweets')
time_series_corr(united_tweets_by_month['COUNT'], united_passengers_by_month['PASSENGERS'])
That's a step backwards, but still a valiant effort.
Try to engineer the tweets time series to remove the big spikes and beat the correlation of $0.2$ that the median rolling window yielded.
Answer. One possible solution is given below:
n_days = 30
slope, coef = np.polyfit(united_tweets.index, united_tweets['COUNT'], 1)
std = united_tweets['COUNT'].std()
def smooth_count(r):
count = r['COUNT']
t = r.name
if count > t*slope + coef + 0.05*std:
return t*slope + coef
else:
return count
united_tweets['SMOOTH_COUNT'] = united_tweets.apply(lambda r: smooth_count(r), axis=1)
plot_time_series(united_tweets['DATE'], united_tweets['SMOOTH_COUNT'], 'Smoothed United Tweets', 'Date', 'Number of Tweets')
# Now apply rolling median
united_tweets['SMOOTH_SMA'] = united_tweets['SMOOTH_COUNT'].rolling(n_days).mean()
plot_time_series(united_tweets['DATE'].iloc[n_days:], united_tweets['SMOOTH_SMA'].iloc[n_days:], 'Smooth SMA of United Tweets', 'Date', 'Number of Tweets')
# need to convert tweets to month
united_tweets_by_month = united_tweets.iloc[n_days:].groupby(['YEAR', 'MONTH']).mean()['SMOOTH_SMA'].reset_index()
time_series_corr(united_tweets_by_month['SMOOTH_SMA'], united_passengers_by_month['PASSENGERS'])
A correlation of $0.29$, not bad!
Now that we have all of our data, let's summarize what we have:
Dependent Variables:
united_passengers_by_monthunited_costsoil_pricesunited_tweets_by_monthTarget Variable:
Everything is a time series, so we need to decide how much history we want to feed into the algorithm. Having more history should help with better predictions, but most of the useful information is in the nearby history. We also have to decide which variables to include. Here are two examples:
$Revenue_i = f(Passengers_{i-3}, Costs_{i-2}, Kerosene_i, Tweets_i)$
This represents using only the most latest passengers, costs, and tweets. On the other end of the spectrum we could have something like:
$Revenue_i = f(Passengers_{i-3}, Passengers_{i-4}, Passengers_{i-5}, Passengers_{i-6}, Passengers_{i-7}, Passengers_{i-8}, Passengers_{i-9}, Passengers_{i-10}, Passengers_{i-11}, Passengers_{i-12}, Passengers_{i-13}, Passengers_{i-14}, Costs_{i-2}, Costs_{i-3}, Costs_{i-4}, Costs_{i-5}, Costs_{i-6}, Costs_{i-7}, Costs_{i-8}, Costs_{i-9}, Costs_{i-10}, Costs_{i-11}, Costs_{i-12}, Costs_{i-13}, Costs_{i-14}, Kerosene_{i}, Kerosene_{i-1}, Kerosene_{i-2}, Kerosene_{i-3}, Kerosene_{i-4}, Kerosene_{i-5}, Kerosene_{i-6}, Kerosene_{i-7}, Kerosene_{i-8}, Kerosene_{i-9}, Kerosene_{i-10}, Kerosene_{i-11}, Tweets_{i}, Tweets_{i-1}, Tweets_{i-2}, Tweets_{i-3}, Tweets_{i-4}, Tweets_{i-5}, Tweets_{i-6}, Tweets_{i-7}, Tweets_{i-8}, Tweets_{i-9}, Tweets_{i-10}, Tweets_{i-11})$
which would be including the last 12 months of data for every time series we have. Let's evaluate both of these models. First, we need to introduce lagged variables into our data by using the pandas shift() function:
oil_prices[['KEROSENE_PRICE']]
oil_prices[['KEROSENE_PRICE']].shift(1)
Finally, we have to merge these lagged DataFrames with our target variable and make dummy variables for the quarters of the revenue filing. Dummy variables are used here because the quarters aren't an ordinal. That is, if the effect of revenue in the first quarter is $\beta$ then the effect in the fourth quarter will not be $4\beta$:
def make_lag_dfs(num_oil_lag, num_passengers_lag, num_costs_lag, num_tweets_lag):
X = pd.DataFrame()
start_date = datetime(2012,1,1)
end_date = datetime(2020,1,1)
X = oil_prices[(oil_prices['DATE'] >= start_date) & (oil_prices['DATE'] <= end_date)]
# oil price lag
for i in range(num_oil_lag):
name = f"KEROSENE_PRICE_{i}"
X[name] = X['KEROSENE_PRICE'].shift(i)
del X['KEROSENE_PRICE']
# passengers lag
columns_to_keep = ['YEAR', 'MONTH', 'PASSENGERS']
X = X.merge(united_passengers_by_month[
(united_passengers_by_month['DATE'] >= start_date) & (united_passengers_by_month['DATE'] <= end_date)]
[columns_to_keep], on=['YEAR', 'MONTH'], how='outer')
for i in range(3, 3+num_passengers_lag):
name = f"PASSENGERS_{i}"
X[name] = X['PASSENGERS'].shift(i)
del X['PASSENGERS']
# costs lag
columns_to_keep = ['YEAR', 'MONTH', 'TOTAL_COST']
X = X.merge(united_costs[(united_costs['DATE'] >= start_date) & (united_costs['DATE'] <= end_date)]
[columns_to_keep], on=['YEAR', 'MONTH'], how='outer')
for i in range(3, 3+num_costs_lag):
name = f"TOTAL_COST_{i}"
X[name] = X['TOTAL_COST'].shift(i)
del X['TOTAL_COST']
# tweets lag
columns_to_keep = ['YEAR', 'MONTH', 'SMOOTH_SMA']
X = X.merge(united_tweets_by_month[columns_to_keep], on=['YEAR', 'MONTH'], how='outer')
for i in range(num_tweets_lag):
name = f"TWEETS_{i}"
X[name] = X['SMOOTH_SMA'].shift(i)
del X['SMOOTH_SMA']
columns_to_keep = ['YEAR', 'MONTH', 'REVENUE']
X = pd.merge(X, united_revenues[columns_to_keep], on=['YEAR', 'MONTH'])
X = X.dropna()
dates = X.apply(lambda r: datetime(r['YEAR'], r['MONTH'], 30) , axis=1)
del X['DATE']
del X['YEAR']
X = pd.get_dummies(X, columns=['MONTH'])
y = X['REVENUE']
del X['REVENUE']
return X, y, dates
X_minimal, y_minimal, dates_minimal = make_lag_dfs(1, 1, 1, 1)
X_maximal, y_maximal, dates_maximal = make_lag_dfs(12, 12, 12, 12)
X_minimal.head()
X_maximal.head()
As we can see, the price you pay for 12 months of historical data is 3 data points which is pretty significant for 30 overall data points. To assess the performance, we will train on 2012 to 2018 and predict the last 3 data points in 2019. The model we will begin with is a simple linear model. Let's also turn the month variable into a dummy variable as we know that it should not be treated as an ordinal:
dates_train = dates_minimal[:-3]
dates_test = dates_maximal[-3:]
X_train, y_train = X_minimal[:-3], y_minimal[:-3]
X_test, y_test = X_minimal[-3:], y_minimal[-3:]
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
pred = clf.predict(X_train)
mean_error = (abs(pred-y_train)/y_train).mean()
standard_deviation = (abs(pred-y_train)/y_train).std()
print('Training Error:')
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
print('Test Error:')
pred = clf.predict(X_test)
mean_error = (abs(pred-y_test)/y_test).mean()
standard_deviation = (abs(pred-y_test)/y_test).std()
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
While the numbers don't lie, we can clearly see that the maximal model is overfitting:
def plot_prediction_time_series(dates_train, train, dates_test, pred, dates_truth, truth, title, x_label, y_label):
years_locator = mdates.YearLocator()
months_locator = mdates.MonthLocator()
years_format = mdates.DateFormatter('%Y')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
ax.ticklabel_format(axis='y', style='plain')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
ax.xaxis.set_major_locator(years_locator)
ax.xaxis.set_major_formatter(years_format)
ax.xaxis.set_minor_locator(months_locator)
sns.lineplot(x=dates_truth, y=truth, marker='o')
sns.lineplot(x=dates_train, y=train, color='orange', marker='o')
dates_pred = list(dates_train[-1:]) + list(dates_test)
plot_pred = list(train[-1:]) + list(pred)
sns.lineplot(x=dates_pred, y=plot_pred, color='black', marker='o')
plt.legend(labels=['United\'s Revenue', 'Training Samples', 'Predictions'], fontsize='14')
rotation = 45
plt.setp(ax.get_xticklabels(), rotation=rotation)
plt.title('United\'s Financials Through Time', fontsize='18')
plt.xlabel('Date', fontsize='16')
plt.ylabel('Revenue (in $1M)', fontsize='16')
plt.show()
plot_prediction_time_series(dates_train, clf.predict(X_train), dates_test, pred, united_revenues[-30:]['DATE'], united_revenues[-30:]['REVENUE'], 'Predicted United Revenues', 'Date', 'Revenue ($ Million)')
dates_train = dates_maximal[:-3]
dates_test = dates_maximal[-3:]
X_train, y_train = X_maximal[:-3], y_maximal[:-3]
X_test, y_test = X_maximal[-3:], y_maximal[-3:]
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
pred = clf.predict(X_train)
mean_error = (abs(pred-y_train)/y_train).mean()
standard_deviation = (abs(pred-y_train)/y_train).std()
print('Training Error:')
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
print('Test Error:')
pred = clf.predict(X_test)
mean_error = (abs(pred-y_test)/y_test).mean()
standard_deviation = (abs(pred-y_test)/y_test).std()
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
plot_prediction_time_series(dates_train, clf.predict(X_train), dates_test, pred, united_revenues[-30:]['DATE'], united_revenues[-30:]['REVENUE'], 'Predicted United Revenues', 'Date', 'Revenue ($ Million)')
As you can see, throwing all the variables at the problem actually makes the model overfit by perfectly tracking the training date. Now, we can see that our lower bound for model performance is something like 3% on the test date. There is still much work that could be done here, such as considering interaction terms and other models.
Come up with your best model for predicting future revenues. Don't forget you can use the make_lag_dfs() function to give you the proper DataFrame with the lagged variables and dummy variables of the quarter.
Answer. Continuing with the theme of overfitting, here is a model that does atrociously:
from sklearn.kernel_ridge import KernelRidge
dates_train = dates_maximal[:-3]
dates_test = dates_maximal[-3:]
X_train, y_train = X_maximal[:-3], y_maximal[:-3]
X_test, y_test = X_maximal[-3:], y_maximal[-3:]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)
clf = GridSearchCV(KernelRidge(kernel='rbf'), param_grid={'alpha': np.logspace(0, -3, 10)})
clf.fit(X_train, y_train)
pred = clf.predict(X_train)
mean_error = (abs(pred-y_train)/y_train).mean()
standard_deviation = (abs(pred-y_train)/y_train).std()
print('Training Error:')
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
print('Test Error:')
pred = clf.predict(X_test)
mean_error = (abs(pred-y_test)/y_test).mean()
standard_deviation = (abs(pred-y_test)/y_test).std()
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
plot_prediction_time_series(dates_train, clf.predict(X_train), dates_test, pred, united_revenues[-30:]['DATE'], united_revenues[-30:]['REVENUE'], 'Predicted United Revenues', 'Date', 'Revenue ($ Million)')
In terms of automated feature selection, we can of course opt for LASSO regression with a grid search:
X, y, dates = make_lag_dfs(0,12,12,12)
dates_train = dates[:-3]
dates_test = dates[-3:]
X_train, y_train = X[:-3], y[:-3]
X_test, y_test = X[-3:], y[-3:]
clf = GridSearchCV(linear_model.Lasso(), param_grid={'alpha': np.logspace(-3, 3, 10)})
clf.fit(X_train, y_train)
pred = clf.predict(X_train)
mean_error = (abs(pred-y_train)/y_train).mean()
standard_deviation = (abs(pred-y_train)/y_train).std()
print('Training Error:')
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
print('Test Error:')
pred = clf.predict(X_test)
mean_error = (abs(pred-y_test)/y_test).mean()
standard_deviation = (abs(pred-y_test)/y_test).std()
print(f'Mean Error: {mean_error}')
print(f'Standard Deviation of Error: {standard_deviation}')
plot_prediction_time_series(dates_train, clf.predict(X_train), dates_test, pred, united_revenues[-30:]['DATE'], united_revenues[-30:]['REVENUE'], 'Predicted United Revenues', 'Date', 'Revenue ($ Million)')
Finally, this case study wouldn't be complete without validating the actual revenue for Q4 of 2019, which was $10.888 billion:
new_row = pd.DataFrame([[2019, 12, 31, 10888, datetime(2019,12,31), 4]], columns=united_revenues.columns)
tmp = united_revenues.copy()
united_revenues = united_revenues.append(new_row)
X, y, dates = make_lag_dfs(0,12,12,12)
united_revenues = tmp.copy()
print(f"Final prediction for Q4 2019 is: {clf.predict(X.iloc[-1:])[0]} vs. an expected 10888")
That represents a roughly 2.2% error; not bad for a model with only 30 training examples!
In this case, we did a data science project end-to-end, from collecting data and cleaning it, visualizing the data to validate it, designing some new features from the data, and finally building a model to complete the task of predict airline revenue. We saw common problems at each step:
At each stage, we overcame the issue by not losing sight of the goal and keeping in mind the big picture. By understanding our problem and its real-world influences, we were able to achieve a better result in the end.
Data science is much more than knowing the latest methods and the tools of the trade. If you ask what makes a good carpenter, you would not answer with "someone that knows how to use a hammer". These should be the points you should focus on from this case:
Understand your problem first. If you don't have a clear objective, you'll be left wandering. Assess your data according to your goal.
Real data science requires trial and error. Even the best data sources make mistakes. You should constantly validate input and output.
Visualizing data is key to understanding. Imagine removing all the graphs from this case and working solely with metrics. While useful, metrics will never be as insightful as plots for understanding failure mechanisms.
Finally, the modeling step is arguably the easiest and least important step for project success. By sticking to simple linear models, we were able to achieve very good results. However, leaving out the Twitter data entirely could have severely compromised our outcomes.